Makam et al. BMC Medical Informatics and Decision Making 201 3, 1 3:81 
http://www.biomedcentral.com/1472-6947/13/81 



Medical Informatics & Decision Making 



RESEARCH ARTICLE Open Access 



Identifying patients with diabetes and the earliest 
date of diagnosis in real time: an electronic 
health record case-finding algorithm 

Anil N Makam 1 , Oanh K Nguyen 1 , Billy Moore 2 , Ying Ma 2 and Ruben Amarasingham 2,3 



Abstract 

Background: Effective population management of patients with diabetes requires timely recognition. Current 
case-finding algorithms can accurately detect patients with diabetes, but lack real-time identification. We sought 
to develop and validate an automated, real-time diabetes case-finding algorithm to identify patients with 
diabetes at the earliest possible date. 

Methods: The source population included 160,872 unique patients from a large public hospital system between 
January 2009 and April 201 1 . A diabetes case-finding algorithm was iteratively derived using chart review and 
subsequently validated (n = 343) in a stratified random sample of patients, using data extracted from the 
electronic health records (EHR). A point-based algorithm using encounter diagnoses, clinical history, pharmacy 
data, and laboratory results was used to identify diabetes cases. The date when accumulated points reached a 
specified threshold equated to the diagnosis date. Physician chart review served as the gold standard. 

Results: The electronic model had a sensitivity of 97%, specificity of 90%, positive predictive value of 90%, and 
negative predictive value of 96% for the identification of patients with diabetes. The kappa score for agreement 
between the model and physician for the diagnosis date allowing for a 3-month delay was 0.97, where 78.4% of 
cases had exact agreement on the precise date. 

Conclusions: A diabetes case-finding algorithm using data exclusively extracted from a comprehensive EHR can 
accurately identify patients with diabetes at the earliest possible date within a healthcare system. The real-time 
capability may enable proactive disease management. 



Background 

Practice redesign efforts are shifting the paradigm from 
volume to value in healthcare in part by emphasizing care 
coordination, population health, and performance re- 
porting. To this end, the National Committee for Quality 
Assurance (NCQA) requires practices to use patient track- 
ing, disease registries and certified electronic health re- 
cords (EHR) in order to qualify for patient-centered 
medical home (PCMH) and accountable care organization 
(ACO) accreditation [1,2]. 

Diabetes is well-suited to the principles of the PCMH 
and ACO, given that it affects 25.8 million people, [3] costs 
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$174 billion annually, [4] and despite well-established and 
effective guidelines only 45% of diabetics receive 
recommended care [5,6]. Effective practice redesign efforts 
aimed at improving the care of diabetes will require pro- 
active identification of patients with diabetes to narrow the 
knowledge-to-action gap. While existing diabetes case- 
finding algorithms are able to accurately identify patients 
with diabetes, [7-17] such algorithms rely on historical ra- 
ther than real-time data. As a result, there may be a lag 
time between when a patient receives a diagnosis of dia- 
betes in the clinical setting compared to when the patient is 
identified as a diabetic by a case-finding algorithm for the 
purpose of population management. Because preventing 
complications of diabetes depends critically on timely inter- 
vention, [18] this lag impedes the potential for case-finding 
algorithms to significandy affect prevention of such compli- 
cations across diabetic populations. 
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No published case-finding algorithm, that we know of 
to date, takes advantage of a comprehensive EHR that 
would allow for a fully automated and electronic system 
obviating the need for manual data entry, and import- 
antly, real-time availability for data capture and identifi- 
cation [19]. Therefore, this study aims to derive and 
validate an electronic case-finding model (e-model) that 
could be used in real-time to identify patients who meet 
criteria for diabetes at the earliest possible date based on 
EHR data within a healthcare system. 

Methods 

Study population 

The e-model was developed using historical data extracted 
from an EHR (Epic Systems Corporation, Verona, WI) 
deployed across inpatient and outpatient settings within 
Parkland Health & Hospital System (PHHS), a large urban 
safety-net health system in Dallas, TX. We used data from 
160,872 unique adult patients (age > 18 years) who had a 
first encounter within PHHS between January 1, 2009 and 
April 1, 2011. 

Definition of algorithm variables 

To determine the criteria used by the e-model to identify 
diabetes, we used a combination of diagnostic criteria 
from the American Diabetes Association (ADA) [5] and 
data elements identified by a panel of physician health ser- 
vice researchers, including a board certified endocrinolo- 
gist (AM, CR, RA). Only those data elements which were 
routinely documented and extractable from structured 
data fields (e.g., encounter diagnosis, past medical history, 
problem list, medications, and laboratory results) within 
the EHR were included as variables in the e-model. 

Derivation of E-model 

We used a point-based algorithm to identify the presence 
of diabetes and to determine the date that a diagnosis 
could have been made through information in the EHR. 
Each variable in the e-model was initially assigned a frac- 
tional point value, proportionate to its perceived relative 
contribution in diagnosing diabetes. Point totals of 0, be- 
tween 0 and 1, and > 1 were set as thresholds for e-model 
determination of 'no diabetes,' 'possible diabetes,' and 'dia- 
betes,' respectively. The e-model determined the diagnosis 
date as the date when accumulated points reached or 
exceeded a threshold value of 1. 

The perceived relative contribution of each variable 
was determined based on ADA diagnostic criteria, [5] 
existing diabetes case-finding algorithms, [10] and expert 
opinion. For example, since the ADA requires two 
fasting blood glucose values of > 126 mg/dL for the diag- 
nosis of diabetes, the presence of a single fasting blood 
glucose value of > 126 mg/dL was assigned a point value 
of 0.5, such that two fasting blood glucose values would 



give an individual a total of 1 point for an e-model iden- 
tification of 'diabetes.' 

Point assignments were subsequently recalibrated 
through a clinically-guided strategy, consisting of an it- 
erative, three-stage evaluation process (Figure 1). At 
each stage, a stratified random sample of up to 500 
charts, with 50% 'diabetes,' 25% 'possible diabetes,' and 
25% 'no diabetes' as determined by the e-model, under- 
went unblinded chart review by a physician to evaluate the 
accuracy of the e-model identification of diabetes and 
diagnosis date. To allow better evaluation of e-model per- 
formance, the 'no diabetes' group was restricted to individ- 
uals 50 years or older since the incidence of diabetes is 
strongly associated with age and may increase the poten- 
tial to identify false negatives. To allow for evaluation of 
the accuracy of the e-model diagnosis date, 50% of the 
'diabetes' group was selected to have accumulated > 1 
point(s) on a date more recent than the date of the first 
encounter. This allowed for a potentially earlier date of 
diagnosis to be determined by chart review. 

Points for individual variables were reweighted after 
each stage based on commonly recurring e-model inac- 
curacies and were finalized after three successive stages 
(Table 1). 

Through the derivation process, we adjusted the point 
values for diabetes medication, problem list and past med- 
ical history, and ICD-9 encounter diagnosis. The presence 
of any diabetes medication was initially assigned a point 
value of 1, since we considered this a surrogate for the 
presence of diabetes given that the primary and often only 
indication is the treatment of hyperglycemia. However, 
during chart review, metformin was found to be occasion- 
ally prescribed for pre-diabetes and polycystic ovarian 
syndrome, and individuals with only the presence of met- 
formin were incorrectly identified by the e-model as 'dia- 
betes.' Therefore, the point value for metformin was 
decreased to 0.75. The point value for the presence of dia- 
betes in the past medical history or problem list fields 
were reduced to 0.4 because the data in these fields were 
found to be often inaccurate and outdated. Lastly, a single 
ICD-9 encounter diagnosis in the absence of other vari- 
ables incorrectly identified patients as having diabetes in 
most cases. Existing case-finding algorithms have also 
found the presence of two ICD-9 codes across outpatient 
and inpatient settings to be highly sensitive and more spe- 
cific than a single code for the diagnosis of diabetes [20]. 
Therefore, the encounter diagnosis variable was adjusted 
from 1 to 0.75 points. 

Validation of E-model 

To validate the e-model, we compared the e-model iden- 
tification of 'diabetes,' 'possible diabetes' and 'no dia- 
betes' and date of diagnosis to the gold standard of 
physician chart review. Based on a conservative estimate 
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Study cohort: 

All PHHS adult patients with first encounter between January 1 , 2009 and April 1 , 201 1 



MODEL DERIVATION 



Chart Selection 

Stratified random sample of 500 charts 

• 50% with 'diabetes' 

• 25% with 'possible diabetes 

• 25% with 'no diabetes' 



Physician Review 

1 . e-Model diagnosis 

2. e-Model identified date of diagnosis 



Model Adjustment 

Address systematic inaccuracies 
Point recalibration for individual e- 
model variables 



Repeat x 2 



After 3 cycles 



Proceed to model validation 



MODEL VALIDATION 



Chart Selection* 

Stratified random sample of 343 charts 

• 50% with 'diabetes' 

• 10% with 'possible diabetes 

• 40% with 'no diabetes' 



Physician Review 

1. e-Model diagnosis 

2. e-Model identified date of diagnosis 



Model Evaluation 

1 . Accuracy of e-mode! diagnosis 

• Kappa between e-model & 
physician reviewer 

• Sensitivity, specificity, PPV, 
NPV of e-model compared to 
physician gold standard 

2. Accuracy of e-model identified date 
of diagnosis 

■ Exact match between e-model 
date and physician gold 
standard 

• Match within 3 months of e- 
model date and physician gold 
standard 



Figure 1 Electronic diabetes case-finding model derivation and validation flowchart. 3 

validation cohort. 



Charts used in derivation were excluded from the 



of 70% sensitivity +/- 10% error at a two-sided alpha of 
0.05 and power of 80%, we determined that a stratified 
random subset of 343 patients (50% 'diabetes,' 10% 'pos- 
sible' diabetes', and 40% 'no diabetes') was needed to ad- 
equately validate the e-model. Charts reviewed during 
the derivation process were not included in the valid- 
ation cohort. 



Blinded chart review for an initial 50 patients was 
completed by two board-certified internists working in- 
dependently (AM and ON). Inter-rater agreement be- 
tween reviewers for the classification of diabetes status 
was 0.80 with a linear weighted kappa statistic and 0.94 
for the exact diagnosis date when both reviewers agreed 
that diabetes was present (n = 19). The remaining charts 



Table 1 Variables included in the electronic diabetes case-finding model 



Identification variable 


Criteria* 


Encounter location 


Point value 


ICD-9 encounter code 


250.XX 


Inpatient or outpatient 


0.75 


Hemoglobin Ale 


> 6.5% 


Inpatient or outpatient 


1.00 


Fasting Blood Glucose 


> 126 mg/dL 


Outpatient only 


0.50 


Random blood glucose 


> 200 mg/dL 


Outpatient only 


0.50 


2-hour OGTT 


> 200 mg/dL 


Inpatient or outpatient 


0.75 


Problem list or PMH 


250.xx 


Inpatient or outpatient 


0.40 


Diabetes medication" 


Present 


Outpatient only 


1.00 


Metformin 


Present 


Outpatient only 


0.75 



Abbreviations: ICD-9: International Classification of Diseases 9, OGTT: oral glucose tolerance test, PMH: past medical history, mg: milligram, dL: deciliter. 
"Criteria are counted only once except for ICD-9 codes (maximum twice} and random and fasting blood glucose (maximum twice each} as long as repeated 
glucose values are > 3 months apart. 

**lnsulin, sulfonylurea, thiazolidinedione, alpha glucosidase, DPP-4 inhibitor, meglitinide, amylin mimetic, incretin mimetic, combination medication. 
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were reviewed by a single physician after establishing ad- 
equate inter-rater reliability. 

To examine model performance, the "possible diabetes" 
group was combined with the "no diabetes" group to allow 
for a dichotomous classification. Two sensitivity analyses 
were performed on the validation cohort to 1) determine 
the optimal point threshold for identification of diabetes 
and 2) determine the effect of an alternate dichotomous 
grouping on e-model performance. For the first sensitivity 
analysis, we varied the point threshold for the identifica- 
tion of diabetes to test for the optimal cutoff point to 
maximize e-model performance. For the second, we 
combined the "possible diabetes" group with the "dia- 
betes" group to examine the effect on e-model sensitiv- 
ity and specificity. 

We then evaluated the performance of the e-model 
on correct identification of the earliest diagnosis date. 
The e-model date was considered to be in agreement 
with the physician date of diagnosis if it was within a 
3 month window of the physician-determined date. The 
time interval of 3 months was chosen based on the 
ADA consensus opinion which recommends repeat gly- 
cemic testing in 3 months for newly diagnosed or poorly 
controlled diabetes patients [5]. Lastly, we compared 
the performance of the e-model in identifying the date 
of diagnosis to the performance of a simplified claims- 
based case-finding algorithm that would not require the 
presence of an EHR. This method utilized the presence 
of only two diabetes encounter codes (ICD-9 250) to 
identify the diagnosis date. 

Statistical analysis 

The diagnostic performance of the e-model on identifi- 
cation of diabetes status was evaluated first by the inter- 
rater agreement between e-model classification and 
physician classification of diabetes status using the kappa 
statistic and second, by using sensitivity, specificity, posi- 
tive predictive value (PPV), and negative predictive value 
(NPV). The optimal point threshold was evaluated 
by the receiver-operator curve. We evaluated the per- 
formance of the e-model on correct identification of the 
diagnosis date by the inter-rater agreement between the 
e-model date of diagnosis and physician determination 
of date of diagnosis for individuals identified as 'diabetes' 
by the physician. 

Analyses were conducted using STATA statistical soft- 
ware (version 12.0; STATA Corp, College Station, TX). 
The University of Texas Southwestern Medical Center In- 
stitutional Review Board approved the research protocol. 

Results 

Using the e-model to characterize the overall study cohort, 
the source population included 14,025 (8.7%) patients 



identified as having 'diabetes,' 1,882 (1.2%) patients as 
'possible diabetes,' and 144,965 (90.1%) patients as 'no 
diabetes.' In the overall diabetic population (n = 14,205), 
the mean age was 52 years (+/- 13 years), 44% were 
Hispanic, 27% black, 22% white, 53% male, 68% had a 
primary payer of self-pay or charity, and the mean num- 
ber of healthcare encounters was 6.5. Of the 1,500 
charts for model derivation, 83 were excluded because 
of duplicated patients or age less than 18 years. Of the 
343 charts for validation, 2 were excluded from analysis 
because the chart did not exist (n = 1) or was a duplicate 
(n = 1). Patients in the derivation and validation cohorts 
were similar with respect to age, race and ethnicity, sex, 
and primary payer, but patients in the derivation group 
had a slightly greater number of encounters over a one- 
year period (Table 2). 

E-model performance on identification of diabetes 

The kappa statistic between the e-model and physician 
reviewer on the question of whether diabetes was 
present was 0.76 (p < 0.001) with 86% overall agree- 
ment. Combining the "possible diabetes" group with 
the "no diabetes" group, the sensitivity, specificity, 

Table 2 Baseline cohort characteristics for the electronic 
diabetes case-finding model* 



Characteristics Derivation Validation P-value 



n 


1417 


341 




Age, mean years (SD) 


49.3 (14.4) 


45.0 (14.6) 


<.001 


Race/ethnicity,% 






.83 


Hispanic 


AA 


40 




White 


23 


24 




Black 


25 


27 




Other 


9 


9 




Male,% 


47 


52 


.13 


Primary payer,% 






.15 


Commercial 


18 


16 




Medicare 


9 


6 




Medicaid 


13 


13 




Self-pay 


30 


29 




Charity 


30 


36 




Encounters, mean no. (SD) 








All 


7.86 (9.03) 


6.65 (9.06) 


.01 


Primary care 


2.25 (3.90) 


1 .99 (3.60) 


.45 


Specialty care 


4.31 (6.86) 


3.62 (7.08) 


.02 


Urgent care and ED 


0.81 (2.58) 


0.61 (0.89) 


.48 


npatient 


0.49 (0.97) 


043 (0.88) 


.25 



^Derivation and validation cohorts defined as per Figure 1. 

**Student f-test for age;/ 2 tests for race/ethnicity, male sex, and primary 

payer; Wilcoxon rank-sum test for encounters. 
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positive predictive value, and negative predictive value 
of the e-model were 97%, 90%, 89%, and 97% respect- 
ively. Alternatively, when the "possible diabetes" group 
was combined with the "diabetes" group, the sensitivity 
and specificity of the e-model were 99% and 81% 
respectively. 

The performance of the e-model by different point 
thresholds for the identification of diabetes corroborated 
using a threshold of 1 point to optimize sensitivity and 
specificity (Figure 2). 

E-model performance on earliest date of diagnosis 

The kappa score between e-model and physician on 
the date of diagnosis was 0.94 with agreement on the 
exact date in 76% of the cases. Among the cases where 
both the physician and e-model made a diagnosis of 
diabetes, only 4 observations (2.6%) were diagnosed by 
the e-model more than 3 months after the correct date 
(Figure 3). 

However, a more simplified claims-based algorithm 
that used only two ICD-9 encounter diagnoses (e.g. the 
diagnosis date equals the date of the second ICD-9 
code) did not accurately identify the diagnosis date 
compared to a physician reviewer. Using this approach, 
the kappa statistic was 0.62 with only 4.6% agreement 
on the exact date. Notably, 27% of the cases (n = 33) 
did not have a second ICD-9 encounter code during 
the study period. 

Discussion 

We developed an electronic case-finding algorithm that 
accurately identified patients with diabetes at their earli- 
est possible date within a healthcare system using data 
extracted from an EHR. The performance of our model 
in identifying patients with diabetes is comparable to 
other diabetes case-finding algorithms [10-17]. How- 
ever, the distinct advantage of our automated, real-time 
algorithm is the timely recognition of diabetes. Relying 
on only two ICD-9 encounter codes to establish the 
diagnosis date, a quarter of the cases in our cohort 
would have been missed and another 11% would have 
had a delayed diagnosis. By using multiple data ele- 
ments we were able to identify the date of diagnosis 
within three months of a physician's chart review date 
in 94% of case, with three-quarters of cases having a 
perfect date match. 

Achieving early glycemic control in patients with 
newly diagnosed diabetes reduces the risk of micro- 
vascular complications, myocardial infarction, and all- 
cause mortality [18]. Attaining the benefits of instituting 
early treatment requires timely diagnosis. In the ARIC 
cohort, a population-based prospective study of middle- 
aged adults, Samuels et al. found that even with an ef- 
fective screening program the median delay from the 




0.00 0~25 O50 075 1.00 



1 - Specificity 

Figure 2 Receiver operating characteristic curve for the 
electronic diabetes case-finding model identification of 
diabetes compared to physician review by different point 
thresholds (C statistic 0.98). 



onset of diabetes to physician diagnosis was 2.4 years, 
with more than 7% of incident cases remaining undiag- 
nosed for at least 7.5 years [21]. In addition, delayed 
diagnoses are more widespread in safety-net settings 
where patients may have more fragmented, episodic 
care [22]. Real-time, automated patient identification 
and tracking can help healthcare systems close this gap 
and facilitate the delivery of timely, effective therapy at 
the point-of-care at the earliest possible date [19]. 

Improving care for diabetics is increasingly import- 
ant for healthcare systems in today's pay-for-performance 
climate. The high cost, rising prevalence, and documented 
quality gap has positioned diabetes in the forefront of pol- 
icies benchmarking performance. To qualify for financial 
incentives and avoid public scrutiny, healthcare systems 
are increasingly faced with the challenge to achieve ac- 
ceptable rates in their diabetic population for targeted 
metrics such as glycated hemoglobin, low-density lipopro- 
tein, and screening for microalbuminuria. Our electronic 
case-finding algorithm, which leverages real-time data 
in the EHR, can enable proactive management of these 
quality measures. Healthcare systems may additionally 
apply this algorithm to provide feedback to providers 
on the quality of their care, generate letters to pa- 
tients, identify underperforming clinics for quality im- 
provement initiatives, link clinical decision support 
tools to inform decision making at the point-of-care, 
and risk stratify diabetic patients to direct limited re- 
sources to patients at greatest risk for developing 
complications. 

Our study has several limitations. First, as with other 
registries, the limitations of miscoding and misclassifica- 
tion prohibited subtype distinction between type 1, type 
2, and secondary diabetes [23]. Second, due to limits in 
study costs, we established an enriched prevalence of 
diabetes of 50% in our validation cohort to reduce the 
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Figure 3 Comparison of the date of diagnosis of diabetes within a healthcare system as ascertained by the electronic diabetes 
case-finding model and physician reviewer. Observations below and to the right of the dashed line (shaded area) are within the allowed 
3-month window for agreement. 



number of charts required for manual chart review. 
While this inflated the positive and negative predictive 
values, the sensitivity and specificity of our algorithm 
remained unaffected and were comparable to other 
diabetes case-finding algorithms. Third, by using the 
e-model algorithm to select our validation cohort, we 
were unable to evaluate how individual data elements 
performed in identifying diabetes. Fourth, the direct 
applicability of our algorithm to other settings is un- 
known because of differences in practice style, EHR in- 
tegration across outpatient and inpatient settings, and 
EHR documentation. Systems with greater accuracy in 
EHR documentation may need to increase the relative 
weight of the problem list and past medical history 
field to maximize the model's performance. With 
proper weight adjustments we expect our algorithm to 
be suitable to a wide range of healthcare settings. Au- 
tomated machine learning techniques may provide ap- 
proaches to model adjustment that could minimize 
manual recalibration and allow larger scales of dissem- 
ination. Lastly, in clinical settings transitioning from 
paper-based records to an EHR, the e-model may not 
accurately distinguish between newly established ver- 
sus preexisting cases of diabetes within a healthcare 
system [24]. 



Conclusion 

Our electronic case-finding algorithm can accurately iden- 
tify patients with diabetes at the earliest possible date 
within a healthcare system. We believe this algorithm can 
be used by healthcare systems with comprehensive EHRs 



to build real-time diabetes identification systems. This is 
foundational to diabetes "system awareness," or building 
information systems that are able to construct and main- 
tain awareness of a patients status across time, setting, 
provider, and context. 
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